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A direct calculation of the complex AI = 1/2 kaon decay amplitude is notoriously difficult be- 
cause of the presence of disconnected graphs. Here we describe and demonstrate two practical 
methods to defeat this problem: the EigCG algorithm and the use of time-separated % — n sources. 
With a fine tuned EigCG implementation for domain wall fermions, the calculation of light quark 
propagators is accelerated by a factor of 5-10 on a variety of lattices from small (16 3 x 32 x 16) 
to large (32 3 x 64 x 32). In addition, a substantial reduction in noise is achieved by separating 
each of the sources for the two pions in the time direction by 2-5 lattice spacings. These methods 
are combined in a calculation of K to kk threshold decay using a 24 3 x 64 x 16 volume and 329 
MeV pions. These methods result in non-zero signals for both Re(Ao) and Im(Ao) from 138 gauge 
configurations. 



XXIX International Symposium on Lattice Field Theory 
July 10-16, 2011 

Squaw Valley, Lake Tahoe, California 



* Speaker. 



© Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. 



http://pos.sissa.it/ 



Methods for calculation ofK to %% Decay 



Qi Liu 



1. Introduction 

To qualitatively understand the experiment phenomena of the AI = 1/2 enhancement rule and 
the direct CP violation in the neutral kaon decay process, a direct calculation of the K° — > nn weak 
matrix elements is needed. This includes the calculation of disconnected graphs and therefore 
requires very large statistics. Our previous systematic study of a full first principle calculation of 
kaon decay on a 16 3 x 32 x 16 volume lattice with Domain Wall Fermion (DWF) showed promising 
results [P and encouraged us to try to approach physical kinematics on a larger volume. Our 
ultimate goal is to calculate the AI = 1/2 decay amplitude with all physical parameters just as 
we have done for the AI = 3/2 decay [gj. Then we can compare the lattice results directly with 
experiment, which will provide us a deeper understanding of the AI = 1/2 rule and a check of the 
fundamental mechanism of CP violation of the standard model. 

While a full calculation with physical kinematics is still out of reach, we extended our previous 
calculation to a larger lattice with volume 24 3 x 64 x 16, and decreased the pion mass from the 
previous 420 MeV to 330 MeV. A simple estimation of the volume effect and the number of CG 
iterations shows that this calculation for each configuration is 27 times more difficult than the 
previous one. Therefore, in this work, we concentrate on techniques to reduce the difficulty of such 
a direct calculation. In the following, two techniques will be discussed in detail, first the EigCG 
Algorithm and then the method with time-separated % — % sources. At the end, we will present our 
latest results for both Aq and A2 from the larger lattice with both techniques incorporated. 

2. Setup for the K° to %% decay calculation 

The effective weak Hamiltonian for the K° to %% decay including 2+ 1 flavors is 

G 10 

H eff = -^v: d V us ^[( Zi (u) + zyi(ji))]Qi. (2.1) 
V2 , =1 

where z, and y, are the Wilson coefficients, <2, are the ten four-fermion operators. For more details 
about the effective weak Hamiltonian, the calculation of Wilson coefficients, and the definition of 
the four-fermion operators, see ref. [|p. To obtain the decay amplitudes, we need to calculate the 
weak matrix element < nn\Qi\K° > for each of the ten operators on the lattice, then convert them 
to the MS scheme, and finally combine with the Wilson coefficients which are also calculated in the 
MS scheme. As we have describe this in detail in the Appendix A of [j|], the conversion from the 
lattice operators into the MS scheme involves two steps. First we convert it into RI/MOM scheme, 
and then convert the RI operators to the MS scheme. 

The most important part of this work is to calculate the weak matrix elements on the lattice 
Q\ at . There are four types of contraction as shown in Figure[T| The detailed structures of the 
different kinds of spin and color contractions for each type and related subtraction graphs are 
discussed in our previous work [@]. 

We use a Coulomb Gauge fixed wall source and sink for the pions and kaons. Because of the 
presence of the disconnected graph (type 4), we are required to be able to put the sources on all 
possible time slices. Therefore, T (the dimension in the time direction) propagators with a wall 
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Figure 1: The four types of contractions that contributes to the calculation of K° to two pions decay. The 
graph circle stands for one of the four-fermion operators, the lines indicate the propagators(with addition 
label s meaning strange quark and otherwise light quark), and the black dot stands for the kaon or the pion 
with a 75 matrix insertion. The type 4 graph is the disconnected graph. 

source for both light and strange quarks are calculated. Even though computationally very expen- 
sive, it gives us the freedom to translate the position of the kaon, the operator and the two pions 
simultaneously, thus effectively increasing the statistics from a given configuration. In addition to 
the Coulomb gauge wall source propagators, we also calculate T random wall source propagators 
to estimate the loop shown in the type 3 and 4 contractions by stochastic method. In total, we need 
to solve 2T light quark propagators: on a typical T=64 lattice, this is equivalent to 1536 Dirac op- 
erator solves. At this point, it is clear to us that a good algorithm to speed up propagator calculation 
is crucial for such a calculation to be manageable. 

3. The EigCG Algorithm 

There are two recently published algorithms for the calculation of propagators that could po- 
tentially provide a factor of 5-10 speed up. The first one is Liischer's inexact low modes defla- 
tion algorithm with the domain-decomposed subspaces that are based on the property called local 
coherence of the low modes [Q]. The second one is the EigCG algorithm by Stathopoulos and 
Orginos []5j|. With the inexact low modes deflation method, we obtained a big factor of improve- 
ment with a 16 3 x 32 x 8 lattice on a single node machine. However, it turns out to be very difficult 
to implement effectively for a highly parallel machine because of the complex structure of the little 
Dirac operator in the case of domain wall fermions. The Dirac operator for DWF Dj w f is not posi- 
tively definite, so the operator we solve has to be D^^Ddwf, the resulting little Dirac operator has 
many hopping terms and it is very ineffective to calculate its inverse. In comparison, the EigCG 
algorithm only requires a few linear algebra operations and can easily adapt to massively parallel 
machine no matter what the operators are. So we used it in our calculation. The disadvantage of 
the EigCG algorithm compared to Liischer's is the huge requirement of memory. Nevertheless, our 
current machine has sufficient memory even for the largest lattice we are currently working on, so 
it is not a serious issue. 
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We follow very closely the original work of EigCG in [Q]. Our goal is to solve Ax = b fast for 
many right hand side vectors b. Here, the operator A we consider is the even odd preconditioned 
DWF operator A = D^^D^ W ~. The EigCG algorithm works as follows: it accumulates many low 
modes during each normal CG solve; for each new solve, it projects out the low mode space that 
the EigCG algorithm already accumulated by an initial solution 

X Q = U{U 1 AU)- l U 1 b (3.1) 

where U projects onto low mode space spanned by the low mode vectors. A typical convergence 
behavior with EigCG is shown in Figure [2[ The first solve is exactly the same as the normal CG 
algorithm. The second solve becomes a little bit faster because of the initial projection of the low 
modes that we already accumulated during the first solve. Gradually, the new solves become faster 
and faster with more and more low modes available. Finally we will stop accumulating low modes 
and simply do projections to speed up the calculation. 

However, there is a clear turning point (around res ~ 1(T 6 ) on the convergence curve for the 
sped-up solves. It dramatically slows down to the normal CG speed at some point. This is because 
of the inaccuracy of the low modes we obtained from each CG solve. Typically, we try to obtain 
roughly 16 low modes from each solve and throw away a few with eigenvalue larger than some 
threshold. It is therefore impossible to get all these 16 low modes very accurately. The strategy to 
avoid the slow down with the low accuracy low modes is to do multiple projections by restarting 
the CG algorithm using the residual of the previous inversion attempt as the new right hand side. 
Following the initial projection as shown in Eq {3.1[ , we do a few more projections in the middle 
of the solving process. For example, suppose that after n iterations the relative residual reduces 
to 10~ 5 , with solution x n and residual r„ = b—Ax n , we can restart the CG with initial solution 
*o = x n + U (U^AU)~ l U^r n on the equation Ax? = r n . As shown in figure § the relative residual 
goes straight down once a restart point at 10~ 5 is introduced. 

There are two things worth a notice. First, during the low mode accumulation stage, we 
prefer not to do multiple restart for the solves since it may affect the efficiency of the low modes 
accumulation. This is the reason that there are turning points in the first 60 convergence curve 
(except the one goes straight down in 1500 iterations) in figure ||. Second, if the low modes are 
extremely inaccurate, we have to do many projections by restarting the CG algorithm. In the worst 
case, we may need to do one projection after each CG step. Then we could better incorporate the 
projection operator in the original operator to perform a so called oblique projection as Luscher's 
algorithm does [Qj. On the other hand, for each restart of the CG, we lose all previous information 
about the direction vectors of the CG algorithm (which is the advantage of the CG to the steepest 
descent algorithm), so it leads to a decrease of efficiency of the CG algorithm. Therefore, it is 
better to do fewer restart, only when it is necessary. 

We have shown in figure ^| that we could successfully apply EigCG to a 32 3 x 64 x 32 lattice, 
and gain a factor of 7 speedup. The number of low modes we accumulate, the required memory to 
achieve this, and the comparison of the number of iterations to the original CG is summarized in 
table |[ Notice that to reduce the memory requirement, we used single precision to store the low 
modes. This has no negative effect on the EigCG algorithm since the low modes we obtained are 
not very accurate any way. The largest lattice we tested (32 3 x 64 x 32) requires 2 Tbytes memory, 
the code runs efficiently on 4k BGL nodes, which provides 4 Tbytes memory. 
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Figure 2: Relative residual versus number of iterations using EigCG on a 32 3 x 64 x 32 DSDR lattice. From 
the first 5 propagator solves (60 Dirac solves), the algorithms accumulate more and more low modes. After 
that, all new solves converge to 10~ 8 in roughtly 1500 iterations, by using one restart at 10 -5 . 



Table 1: The speedup from EigCG algorithm on different lattices. N prop stands for the number of propagator 
solves to get the required number of low modes Ni, nv . The symbol * means that it is a quenched calculation. 



Lattice 


m% 


CG 


Nlow(Np r0 p) 


Total Memory 


EigCG 


speed up 


16 3 x 32 x 16 


421 MeV 


1840 


120(1) 


12 GB 


370 


5.0 


16 3 x 32 x 16 


204* MeV 


3200 


120(1) 


12 GB 


460 


7.0 


24 3 x 64 x 16 


330 MeV 


2900 


400(4) 


272 GB 


530 


5.5 


32 3 x 64 x 32 


180 MeV 


10400 


600(5) 


2 TB 


1480 


7.0 



4. Time separated n — n source 

We separate the two pion sources in the time direction by 8 (figure ||) , therefore reducing 
the correlation between the two pion sources. It can dramatically reduce the vacuum noise from 
the disconnected graph. For example, the error on the isospin zero tz — tz energy is reduced from 
0.0126 to 0.0055 by introducing a separation of 4 between the two pions. As shown in figure |], 
the effective mass plateau also begins earlier, even though we still use a fixed fitting range 5 to 15, 
inclusive. 

5. K° to tztz decay amplitudes and conclusion 

Using the techniques we have mentioned, we performed a threshold = 2m 7[ , K — > TZTZ 
calculation on a Nf = 2 + 1 flavor 24 3 x 64 x 16 lattice with DWF, Iwasaki gauge action, cT x = 
1.729(30) GeV, and a 330 MeV pion mass. The EigCG algorithm speeds up the calculation by a 
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Figure 3: Separating the two pion sources in the time direction. The left panel shows the setup for the %—n 
scattering calculation, and the right panel shows the setup for the k — > nn decay calculation. 
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Figure 4: Effective mass plot for the two pions in the isospin zero channel. The left one uses K — % separation 
0, and the right one uses 4. The energy calculated from these two setups is 0.3922(126) and 0.3639(55) 
respectively. 



factor of 5, and introducing a separation between the two pion sources by 4 makes the signal much 
better. 

Once we calculate the correlation functions, we do a single parameter fit to find the weak 
matrix elements, 

{OK{0)Qi{to P )Onn^A + 8)) = l/2M-(m K -E^, f g n 

where the kaon energy and tc — 71 energy are fitted from the kaon and Tin correlation functions. 
Results for operator Q2 which makes a major contribution to Z?e(A ) and the operator <26 which 
makes a major contribution to Im(Ao) are shown in figure |5[ A summary of the final results obtained 
by combing NPR and Wilson coefficients are shown in table BL This calculation is performed on 
138 configurations. 

In summary, we performed a full first principle calculation for both A2 and Aq in a 2.7 fm box, 
with a 660 MeV kaon decaying to two 330 MeV pions. The agreement of the results with and 
without disconnected graphs indicats that the diconnected graphs may not play a crucial role in this 
particular decay process. A ratio of 12.0(1.7) for Re(Ao) to Re{A2) suggests already a dramatic 
A/ = 1/2 rule effect. The direct CP violation measure Re(e'/e) is calculated to be 2.0(1.7) x 10~ 3 
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Figure 5: The weak matrix element for < 7t7tj = o\Q2\K° > (left) and < 7Z7Z I=0 \Q 6 \K° > (right). The x-axis 



represents the position of the operator relative the the kaon, and y-axis is the amplitude defined in Eq. 5.1 
The ' symbol represents the result without the disconnected graph. We used A = 16, and 5=4 here. 



for these unphysical kinematics. In the future, we are going to collect more statistics to resolve a 
clear signal for e' and then move to a calculation with physical kinematics. 

Table 2: K° — > %n decay amplitudes for a threshold calculation with w 2m n . The unit for Real part 
is x 1CT 8 GeV, and Imaginary part is x 10~ 12 GeV. The symbol ' indicates that the disconnected graphs are 
ignored. 



m^(MeV) m K (Me,V) 


Re(A ) 


Re(A' ) 


Im(A ) 


Im(A' ) 


Re(A 2 ) 


Im(A 2 ) 


329.3 662.1 


31.1(4.5) 


27.8(0.8) 


-33(15) 


-36.3(16) 


2.668(14) 


-0.6509(34) 
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